DTSA 5511 Introduction to Machine Learning: Deep Learning

Week 4: Natural Language Processing with Disaster Tweets

Author
Affiliation

Andrew Simms

University of Colorado Boulder

Published

December 1, 2024

1 Problem Description

This projects builds a Recurrent Neural Network model for the Natural Language Processing with Disaster Tweets competition, hosted on Kaggle (Howard et al. 2019), with the objective of developing a machine learning model that can accurately classify tweets as disaster-related or not. The dataset consists of 10,000 manually labeled tweets, creating a binary classification task where tweets are labeled 1 for disaster-related content and 0 non disaster-related content.

To achieve this goal, this project will develop a neural network model using PyTorch to build of a Recurrent Neural Network (RNN) model, which is should be well-suited for processing sequential text data. The final predictions generated by the trained model will be submitted to Kaggle for evaluation.

To achieve this goal, this project will leverage PyTorch (Ansel et al. 2024) to design and implement a Recurrent Neural Network (RNN), a neural architecture well-suited for processing sequential text data. The trained model will generate predictions that will be submitted to Kaggle for evaluation.

This project will address the following research questions:

Table 1: Project Research Questions
Research Area Question
Data Preparation How should text data be preprocessed to maximize model performance?
Model Building How do we implementation a RNN model in PyTorch?
Hyperparameter Tuning What static hyperparameters should be defined and what dynamic hyperparameters should be tuned?
Model Performance What performance metrics should be used?
How do the models perform during training, validation, and testing?
Improvement Strategies What methods can be used to further enhance model performance?

Beyond answering these questions, the project aims to address the technical challenges related to RNN models, including mitigating overfitting, handling exploding gradients, and balancing model complexity with prediction accuracy.

The workflow for this research is summarized in Figure 1. The process begins with exploratory data analysis and preprocessing, followed by model training. Hyperparameter tuning is iteratively applied to refine model performance, culminating in the development of final models for Kaggle submission.

flowchart LR
    EDA["<div style='line-height:1.0;'>Exploratory<br>Data<br>Analysis</div>"]
    --> Clean["<div style='line-height:1.0;'>Clean<br>Original<br>Data</div>"]
    --> BuildModel["<div style='line-height:1.0;'>Build<br>RNN<br>Model</div>"]
    --> Train["<div style='line-height:1.0;'>Train<br>Model</div>"]
    --> Tune["<div style='line-height:1.0;'>Tune<br>Hyperparameters</div>"]
    --> OutputFinal["<div style='line-height:1.0;'>Final<br>Models</div>"]
    --> Submit["<div style='line-height:1.0;'>Submit<br>Results</div>"]
    Tune --> Train
Figure 1: RNN Project Workflow

2 Exploratory Data Analysis

For the defined binary classification task test and training data were supplied tweets, location, keywords, an id and a class. Each training tween is labeled as either disaster-related (1) or not (0). The primary input feature is text (the original tweet content), supplemented by fields like keyword and location.

2.1 Training Data Columns and Types

Code
import pandas as pd

train_df = pd.read_csv("../data/train.csv")
train_df.info()
Table 2: Data Columns and Types of Training Data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB

Table 2 details the available data. Of note are large numbers of non existant data in the location column, and a small amount of data missing in the keyword column. id and target are numbers while the other 3 columns are text.

2.2 Training Data Sample

Code
train_df.loc[train_df['location'].notna()].head()
Table 3: Sample of Training Data
id keyword location text target
31 48 ablaze Birmingham @bbcmtd Wholesale Markets ablaze http://t.co/l... 1
32 49 ablaze Est. September 2012 - Bristol We always try to bring the heavy. #metal #RT h... 0
33 50 ablaze AFRICA #AFRICANBAZE: Breaking news:Nigeria flag set a... 1
34 52 ablaze Philadelphia, PA Crying out for more! Set me ablaze 0
35 53 ablaze London, UK On plus side LOOK AT THE SKY LAST NIGHT IT WAS... 0

Table 3 outputs the contents of a subset of the data. Data in the keyword column appears to be somewhat standardized while data in the location and text columns appear to be original inputs from the user.

2.3 Distribution of Target Values

It is common for binary classification training data to have an equal weight of true and false values in the input. This is calculated by counting the occurrence of each value.

Code
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()


plt.figure(figsize=(7, 2.5))
sns.countplot(x='target', data=train_df)
plt.xlabel("Target")
plt.ylabel("Count")
plt.title("Count of Target Values")
plt.show()
Figure 2: Histogram of target Values in Training Data

In Figure 2 there are unequal values of each class in the data with a larger amount of negative values. This imbalance is important and must be accounted for during the model training and validation.

2.4 Sample of Positive Tweets

Code
train_df.loc[train_df['target'] == 1].head()
Table 4: Sample of Positive Training Data
id keyword location text target
0 1 NaN NaN Our Deeds are the Reason of this #earthquake M... 1
1 4 NaN NaN Forest fire near La Ronge Sask. Canada 1
2 5 NaN NaN All residents asked to 'shelter in place' are ... 1
3 6 NaN NaN 13,000 people receive #wildfires evacuation or... 1
4 7 NaN NaN Just got sent this photo from Ruby #Alaska as ... 1

In Table 4, positive tweets appear to have some relation to the disaster they are describing. The content appears to have multiple complex words and hashtags.

2.5 Sample of Negative Tweets

Code
train_df.loc[train_df['target'] == 0].head()
Table 5: Sample of Negative Training Data
id keyword location text target
15 23 NaN NaN What's up man? 0
16 24 NaN NaN I love fruits 0
17 25 NaN NaN Summer is lovely 0
18 26 NaN NaN My car is so fast 0
19 28 NaN NaN What a goooooooaaaaaal!!!!!! 0

In Table 5, negative tweets have content that appears irrelevant to a disaster. This content appears relatively generic and not specific to any event or location.

2.6 Word Count

To identify if data cleaning is necessary, the content of the tweets (text column) is visualized below. Each row is split into words lists by whitespace and a combined list of words is generated and counted.

Code
import numpy as np
from collections import Counter

def plot_word_horizontal_bar_chart(df, column, top_n=10, figsize=None):
    """
    Plot a horizontal bar chart of the most common words.

    :param words: List of words to analyze.
    :param top_n: Number of most common words to display.
    """
    # Count word frequencies
    words = word_list = dataframe_to_word_list(train_df, column)
    word_counts = Counter(words)
    most_common = word_counts.most_common(top_n)

    # Split words and their counts
    labels, counts = zip(*most_common)

    # print(labels[:10])
    # print(counts[:10])

    # # Plot the horizontal bar chart
    if figsize is None:
        figsize=(3, 6)
    plt.figure(figsize=figsize)
    plt.barh(labels, counts)
    plt.xlabel("Count")
    # plt.ylabel("Words")
    plt.title(f"Top {top_n} Words in {column}")
    plt.gca().invert_yaxis()  # Invert the y-axis to display the highest count at the top
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()

def dataframe_to_word_list(df, text_column):
    """
    Convert a DataFrame column of text into a list of words.

    :param df: The input DataFrame.
    :param text_column: The name of the column containing text data.
    :return: A list of words.
    """
    # Tokenize each row into words and flatten into a single list
    words = df[text_column].str.split().explode().tolist()
    return [word for word in words if word and word is not np.nan]
Code
plot_word_horizontal_bar_chart(train_df, 'text', top_n=25)
plot_word_horizontal_bar_chart(train_df.loc[train_df['location'].notna()], 'location', top_n=25)
plot_word_horizontal_bar_chart(train_df, 'keyword', top_n=25)
(a) text Column
(b) location Column
(c) keyword Column
Figure 3: Original Data Word Count by Column

The word count histograms in Figure 3 reveals a mixture of relevant and possible unnecessary words and characters in each column. Additionally some characters may be removed to cleanup the input into the model.

3 Data Cleaning

Based on the word count visualizations in Figure 3, it is evident that removing common stop words may have an effect on the model’s performance. This process will be implemented using the NLTK Python package (Bird, Klein, and Loper 2009). To allow data cleaning level to be a hyperparameter and track different preprocessing configurations, the cleaned datasets will be labeled with an a_<count> suffix, where the first cleaned dataset will be designated as a1.

3.1 Data Level a1: Removing Stop Words

Code
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def filter_stop_words(df, column):
    def filter_stop_words(word_list):
        """Filter stop words from a list of words."""
        words = str(word_list).split()
        words = [word for word in words if word != "nan"]
        return " ".join([word for word in words if word.lower() not in stop_words and word])

    # Apply the filtering function to the specified column
    df[column] = df[column].apply(filter_stop_words)

    return df

train_df_a1 = train_df.copy()

train_df_a1 = filter_stop_words(train_df_a1, 'text')
train_df_a1 = filter_stop_words(train_df_a1, 'location')
train_df_a1 = filter_stop_words(train_df_a1, 'keyword')
Code
plot_word_horizontal_bar_chart(train_df_a1, 'text', top_n=25)
plot_word_horizontal_bar_chart(train_df_a1, 'location', top_n=25)
plot_word_horizontal_bar_chart(train_df_a1, 'keyword', top_n=25)
(a) text Column
(b) location Column
(c) keyword Column
Figure 4: A1 Data Word Count by Column

3.2 Data Level a2: Removing Unnecessary Characters

It may be beneficial to further clean the data. We some high level techniques to normalize the input data.

Code
import re

def clean_df_text_column(df, column):
    def clean_row(tweet):
        words = tweet.split()
        cleaned_words = []
        for word in words:
            # Remove URLs
            # word = re.sub(r'http\S+|www\S+', '[URL]', word)

            # Replace user mentions (@username) with a placeholder
            # word = re.sub(r'@\w+', '[USER]', word)

            # Remove hashtags but keep the word (e.g., "#earthquake" → "earthquake")
            # word = re.sub(r'#(\w+)', r'\1', word)

            # Remove unwanted characters (e.g., punctuation)
            word = re.sub(r'[^\w\s]', '', word)

            # Remove dashes
            word = re.sub('-', '', word)

            # Remove extra spaces (if any remain)
            word = word.strip()

            # Add the cleaned word to the list if it's not empty
            if word and len(word) > 1:
                cleaned_words.append(word.lower())

        return " ".join(cleaned_words)

    df[column] = df[column].apply(clean_row)

    return df

train_df_a2 = train_df_a1.copy()

train_df_a2 = clean_df_text_column(train_df_a2, 'text')
train_df_a2 = clean_df_text_column(train_df_a2, 'location')
train_df_a2 = clean_df_text_column(train_df_a2, 'keyword')
Code
plot_word_horizontal_bar_chart(train_df_a2, 'text', top_n=25)
plot_word_horizontal_bar_chart(train_df_a2, 'location', top_n=25)
plot_word_horizontal_bar_chart(train_df_a2, 'keyword', top_n=25)
(a) text Column
(b) location Column
(c) keyword Column
Figure 5: A2 Data Word Count by Column

3.3 Final Cleaning Results

Final cleaning results from data level a2 are detailed below

3.3.1 Text Content

To determine if there are differences between cleaned positive and negative tweets a sample of randomly selected tweets from each class is output below.

3.3.1.1 Positive Tweets

Code
# train_df_a2[train_df_a2['target'] == 1].loc[train_df_a2['location'].str.len > 1].head()
train_df_a2[
    (train_df_a2['target'] == 1) &
    (train_df_a2['location'].str.len() > 1) &
    (train_df_a2['keyword'].str.len() > 1)
].sample(frac=1.0).head()
Table 6: Sample of Cleaned Positive Training Data
id keyword location text target
4329 6148 hijack nigeria criminals hijack lorries buses arrested enugu ... 1
5085 7252 nuclear20disaster netherlands fukushimatepco fukushima nuclear disaster incr... 1
3871 5503 flames santo domingo alma rosa soloquiero maryland mansion fire killed caused... 1
6221 8880 smoke ktx get smoke shit peace 1
5710 8147 rescuers iminchina video were picking bodies water rescuers searc... 1

3.3.1.2 Negative Tweets

Code
# train_df_a2[train_df_a2['target'] == 0].head()
# train_df_a2[train_df_a2['target'] == 0].loc[train_df_a2['location'].notna()].head()
train_df_a2[
    (train_df_a2['target'] == 0) &
    (train_df_a2['location'].str.len() > 1) &
    (train_df_a2['keyword'].str.len() > 1)
].sample(frac=1.0).head()
Table 7: Sample of Cleaned Negative Training Data
id keyword location text target
5997 8562 screams gladiator û860û757û casually phone jasmine cries screams spider 0
338 485 armageddon flightcity uk official vid gt doublecups gtgt httpstcolfkmtz... 0
5682 8109 rescued bournemouth finnish hip hop pioneer paleface rescued drift... 0
2598 3727 destroyed waco texas always felt like namekians black people felt p... 0
3510 5017 eyewitness rhode island wpri 12 eyewitness news rhode island set moder... 0

In the randomly selected tweets there are some differences between the two classes, but after the cleaning the differences are not readily apparent from a content perspective. Both positive and negative tweets have some level of text that is not readily comprehensible.

3.3.2 Visualizations

Code
def count_unique_words(input_df, column):
    # Create a set to store unique words
    unique_words = set()

    # Iterate through each row in the column
    for text in input_df[column]:
        if isinstance(text, str):  # Ensure the entry is a string
            words = text.split()  # Split into words and normalize to lowercase
            unique_words.update(words)  # Add words to the set

    # Return the size of the set
    return len(unique_words)
Code
results = [
    {
        "class": "original",
        "column": "text",
        "count": count_unique_words(train_df, 'text')
    },
    {
        "class": "a1",
        "column": "text",
        "count": count_unique_words(train_df_a1, 'text')
    },
    {
        "class": "a2",
        "column": "text",
        "count": count_unique_words(train_df_a2, 'text')
    },
    {
        "class": "original",
        "column": "keyword",
        "count": count_unique_words(train_df, 'keyword')
    },
    {
        "class": "a1",
        "column": "keyword",
        "count": count_unique_words(train_df_a1, 'keyword')
    },
    {
        "class": "a2",
        "column": "keyword",
        "count": count_unique_words(train_df_a2, 'keyword')
    },
    {
        "class": "original",
        "column": "location",
        "count": count_unique_words(train_df, 'location')
    },
    {
        "class": "a1",
        "column": "location",
        "count": count_unique_words(train_df_a1, 'location')
    },
    {
        "class": "a2",
        "column": "location",
        "count": count_unique_words(train_df_a2, 'location')
    },
]

results_df = pd.DataFrame(results)

results_df = results_df.rename({
"class": "Data Level",
"count": "Count",
"column": "Data Column",
}, axis="columns")

def plot_count_hist(input_df, column):
    plt.figure(figsize=(4, 3))

    # Create the barplot
    ax = sns.barplot(
        data=input_df.loc[input_df["Data Column"] == column],
        y="Count",
        x="Data Column",
        hue="Data Level",
    )

    # Customize the legend position to appear below the plot
    plt.legend(
        title="Data Level",  # Optional: Add a title to the legend
        loc="upper center",  # Center the legend horizontally
        bbox_to_anchor=(0.5, -0.25),  # Adjust the vertical position below the plot
        ncol=3,  # Display the legend in two columns (optional for compactness)
        frameon=False,  # Remove the legend border (optional)
    )

    plt.xlabel(None)
    plt.tight_layout()  # Adjust layout to prevent overlap
    plt.show()
Code
plot_count_hist(results_df, "text")
Figure 6: Count of unique values in text throughout cleaning process
Code
plot_count_hist(results_df, "location")
Figure 7: Count of unique values in location throughout cleaning process
Code
plot_count_hist(results_df, "keyword")
Figure 8: Count of unique values in keyword throughout cleaning process

In Figure 6 the number of unique values decreases at each data processing step. For the text column removing stop words slightly reduces the unique values count while performing text clean removes a significant number of values. location in Figure 7 follows a similar pattern, but the number of words removed by the stop word cleaning is higher that text. keyword in Figure 8 shows no change to the cleaning suggesting that the data format is defined in the user input and the original data is in a formatted state..

Table 8: Data Level Descriptions
Data Level Description
Original Original data without modifications
a1 Stop words removed
a2 a1 & lowered, removed white space, remove length less than 2, remove punctuation

We will now process the training and test data using the above functions and same them to parquet files as input for testing:

Code
from pathlib import Path
data_path = Path("../data/preprocessed").resolve()
data_path.mkdir(exist_ok=True, parents=True)

train_data_path = Path(data_path, "train")
train_data_path.mkdir(exist_ok=True)
train_raw_filename = Path(train_data_path, "train_raw.parquet")
train_a1_filename = Path(train_data_path, "train_a1.parquet")
train_a2_filename = Path(train_data_path, "train_a2.parquet")

train_df.to_parquet(train_raw_filename)
train_df_a1.to_parquet(train_a1_filename)
train_df_a2.to_parquet(train_a2_filename)

test_data_path = Path(data_path, "test")
test_data_path.mkdir(exist_ok=True)
test_raw_filename = Path(test_data_path, "test_raw.parquet")
test_a1_filename = Path(test_data_path, "test_a1.parquet")
test_a2_filename = Path(test_data_path, "test_a2.parquet")

test_df = pd.read_csv("../data/test.csv")
test_df_a1 = test_df.copy()
test_df_a1 = filter_stop_words(test_df_a1, 'text')
test_df_a1 = filter_stop_words(test_df_a1, 'location')
test_df_a1 = filter_stop_words(test_df_a1, 'keyword')

test_df_a2 = test_df_a1.copy()
test_df_a2 = clean_df_text_column(test_df_a2, 'text')
test_df_a2 = clean_df_text_column(test_df_a2, 'location')
test_df_a2 = clean_df_text_column(test_df_a2, 'keyword')

test_df.to_parquet(test_raw_filename)
test_df_a1.to_parquet(test_a1_filename)
test_df_a2.to_parquet(test_a2_filename)

3.4 Tokenization

To clean the data another step is converting the text into a numerical representation that a neural network can work with. This can be implemented using multiple methods and one popular library is the Transformers Python library developed by Wolf et al. (2022). As detailed by AkaraAsai on GitHub there are many options for pretrained transformers. For this project we will use the bert-base-uncased and bert-base-casedtokenizers via the PreTrainedTokenizer class in the transformers library.

Listing 1: Tokenizer Data Loader Implemention
def preprocess_dataframe(
    df: pd.DataFrame,
    tokenizer: PreTrainedTokenizer,
    text_max_length: int,
    keyword_max_length: int,
    location_max_length: int,
):
    ids, tokens, attentions, targets = [], [], [], []

    max_length = text_max_length + keyword_max_length + location_max_length

    df["keyword"] = df["keyword"].fillna("")
    df["location"] = df["location"].fillna("")

    for _, row in df.iterrows():
        # Tokenize each component and add special tokens to indicate type
        text_tokens = tokenizer.encode(
            row["text"],
            add_special_tokens=True,
            truncation=True,
            max_length=text_max_length,
            padding="max_length",
            return_tensors="pt",
        ).tolist()[0]
        keyword_tokens = tokenizer.encode(
            row["keyword"],
            add_special_tokens=True,
            truncation=True,
            max_length=keyword_max_length,
            padding="max_length",
            return_tensors="pt",
        ).tolist()[0]
        location_tokens = tokenizer.encode(
            row["location"],
            add_special_tokens=True,
            truncation=True,
            max_length=location_max_length,
            padding="max_length",
            return_tensors="pt",
        ).tolist()[0]

        # Combine tokens
        combined_tokens = text_tokens + keyword_tokens + location_tokens

        # Create initial attention mask with 1s for all tokens
        attention_mask = [1] * len(combined_tokens)

        # Pad tokens and attention mask to max_length
        padding_length = max_length - len(combined_tokens)
        combined_tokens += [tokenizer.pad_token_id] * padding_length
        attention_mask += [0] * padding_length

        # Update attention mask to set positions with padding tokens to 0
        attention_mask = [
            0 if token == tokenizer.pad_token_id else mask
            for token, mask in zip(combined_tokens, attention_mask)
        ]

        # Collect processed data
        ids.append(row["id"])
        tokens.append(combined_tokens)
        attentions.append(attention_mask)
        try:
            targets.append(row["target"])
        except KeyError:
            continue

    # Return targets if they exist
    if len(targets) > 0:
        return pd.DataFrame(
            {"id": ids, "tokens": tokens, "attention": attentions, "target": targets}
        )
    else:
        return pd.DataFrame({"id": ids, "tokens": tokens, "attention": attentions})

4 Recurrent Neural Network Model

To solve the disaster tweet classification problem, a RNN model was implemented using PyTorch (Ansel et al. 2024). This model is specifically designed to process and classify sequential text data while also leveraging additional features, such as keyword and location, to improve accuracy. The architecture has the following key components and characteristics:

4.1 Embedding Layers

The model uses three separate embedding layers (text_embedding, keyword_embedding, and location_embedding) to convert the categorical input (text, keyword, and location) into dense, low-dimensional vectors.

  • Embedding Dimension: The size of these dense representations is a tunable hyperparameter (embedding_dim, default: 128).
  • Padding Index: A padding index of 0 ensures uniform sequence lengths for inputs with varying sizes.

4.2 Recurrent Layer (LSTM)

The embeddings are concatenated into a combined representation, which serves as input to an LSTM (Long Short-Term Memory) layer first described by Hochreiter and Schmidhuber (1997). For this project the input dimension is can be tuned as a hyperparameter:

  • Input Dimensions
    • The concatenated embeddings have a dimensionality of \text{Embedding Dimension} \times 3 as all three embeddings are combined.
  • Hidden Dimension The LSTM layer processes the input and outputs a hidden state with size hidden_dim. To reduce the number of hyperparamaters this value is fixed at \text{Input Dimension} * 2
  • Batch Processing: The batch_first=True argument ensures the input tensors are structured as (batch size, sequence length, feature size).

4.3 Fully Connected Layer

The final hidden state of the LSTM, corresponding to the last time step, is passed through a fully connected (fc) layer to reduce the dimensionality to the output size of 1.

4.4 Sigmoid Activation

The output of the fully connected layer is passed through a sigmoid activation function, which scales the predictions to a range of [0, 1]. These outputs represent the probability of a tweet being disaster-related. For validation or testing, these probabilities are thresholded (e.g., \geq 0.5) to classify the output as either 1 or 0.

4.5 RNNWithMultiInput Class Definition

Listing 2: RNN Pytorch Model
import torch
import torch.nn as nn

class RNNWithMultiInput(nn.Module):
    def __init__(
        self,
        vocab_size,
        use_attention=False,
        embedding_dim=128,
        hidden_dim=256,
        output_dim=1,
    ):
        super(RNNWithMultiInput, self).__init__()
        self.use_attention = use_attention
        self.text_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.keyword_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.location_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        self.lstm = nn.LSTM(
            embedding_dim * 3,
            hidden_dim,
            batch_first=True,
        )
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(
        self,
        text_input_ids,
        text_attention_mask,
        keyword_input_ids,
        keyword_attention_mask,
        location_input_ids,
        location_attention_mask,
    ):
        # Embedding layers
        text_emb = self.text_embedding(text_input_ids)
        keyword_emb = self.keyword_embedding(keyword_input_ids)
        location_emb = self.location_embedding(location_input_ids)

        if self.use_attention is True:
            text_emb = text_emb * text_attention_mask.unsqueeze(-1)
            keyword_emb = keyword_emb * keyword_attention_mask.unsqueeze(-1)
            location_emb = location_emb * location_attention_mask.unsqueeze(-1)

        # Combine embeddings
        combined_emb = torch.cat((text_emb, keyword_emb, location_emb), dim=2)

        # Pass through LSTM
        lstm_out, _ = self.lstm(combined_emb)
        last_hidden_state = lstm_out[:, -1, :]

        # Fully connected layer
        logits = self.fc(last_hidden_state)
        return self.sigmoid(logits).squeeze()

4.6 Hyperparameter Tuning

Key hyperparameters for optimization include:

  • Embedding Dimension (embedding_dim): Controls the size of the feature space for text representation.
  • Hidden Dimension (hidden_dim): Determines the capacity of the LSTM layer to capture sequential patterns.
  • Batch Size and Learning Rate: While not part of the architecture, these parameters significantly influence training efficiency and model performance.

4.6.1 Attention Mechanism Hyperparameter

The model includes an optional attention mechanism (enabled by the use_attention flag). If enabled, attention masks are applied to the embeddings to emphasize relevant parts of the input sequences while ignoring padded elements.

4.7 Rationale for Architecture

This architecture is well-suited for this problem because:

  1. The LSTM layer efficiently captures sequential dependencies in textual data, which is critical for understanding the context within tweets.
  2. By incorporating separate embeddings for keyword and location, the model leverages additional information beyond the tweet text, potentially improving classification accuracy.
  3. The flexibility to enable or disable attention mechanisms provides adaptability for datasets with varying levels of noise or irrelevant data.

5 Training

The training process for this project begins with an 80/20 train-test split of the dataset, ensuring a robust and reliable evaluation of model performance. A critical consideration in training a neural network is the selection of an appropriate validation metric. Given the binary classification nature of the task—where tweets are labeled as disaster-related (1) or non-disaster-related (0)—traditional metrics such as accuracy, precision, recall, and F1 score must be carefully evaluated in the context of class imbalance.

For this dataset, F1 score is chosen as the primary evaluation metric. This decision is driven by the inherent class imbalance in the disaster-related tweets, where false positives (non-disaster tweets incorrectly classified as disasters) and false negatives (disaster tweets missed by the model) both carry significant consequences. The F1 score, being the harmonic mean of precision and recall, provides a balanced measure that accounts for both types of error, ensuring the model optimizes performance in a way that minimizes the impact of misclassifications.

With the evaluation metric established, we proceed to the core aspects of the training process, which focus on optimizing the model’s performance. Several key data-driven questions guide this process:

  1. Baseline Determination: How does the model perform with default hyperparameter settings?
  2. Tokenizer Selection: Which tokenizer should be employed to preprocess the textual data effectively, accounting for nuances such as tokenization of hashtags, mentions, and special characters?
  3. Data Cleaning: What level of data preprocessing (e.g., removal of stopwords, punctuation, or special characters) is optimal for this task to ensure high-quality input data?
  4. Embedding Layers: What embedding dimension best balances the model’s ability to capture semantic information without increasing computational complexity unnecessarily?

To answer each of these questions different models will be produced to answer each question, learning from the results of the last question. At each comparison step multiple embedding dimensions will be modeled to see if this has an effect on the outcome.

The goal is to fine-tune these hyperparameters and preprocessing steps to strike a balance between model complexity and predictive accuracy, ensuring the best possible performance on unseen data. Following this optimization process, the top three models—based on their performance in training and validation—will be selected for final evaluation and submission to Kaggle.

Code
import duckdb

con = duckdb.connect()
query = "SELECT * FROM read_parquet('../train_stats_f1/*.parquet', union_by_name=True)"
df = con.execute(query).fetchdf()

df['Positive Ratio'] = (df['Validation True Positive'] + df['Validation False Positive']) / (df['Validation True Positive'] + df['Validation True Negative'] + df['Validation False Positive'] + df['Validation False Negative'])


df["Epoch"] = df["Epoch"].astype(int)
Code
def plot_vs_embedding_dim(df, y_col, hue = "Embedding Dimensions", show_legend = True, ylim=None):
    plt.figure(figsize=(5.5, 3))
    sns.lineplot(df, x = 'Epoch', y=y_col, hue = hue, palette="deep", legend=show_legend)

    # Customize legend
    if show_legend is True:
        plt.legend(
            title="\n".join(hue.split(" ")),
            loc='center left',  # Adjust to the right of the plot
            bbox_to_anchor=(1, 0.5),  # Position to the right
            frameon=False  # Remove background and border
        )

    if ylim is not None:
        plt.ylim((0, ylim))

    plt.tight_layout()
    plt.show()

5.1 Baseline Models

5.1.1 Training Loss and F1 Score

Code
df_baseline = df.loc[
    (df["Comparison Type"] == 'Baseline')
]
Code
plot_vs_embedding_dim(
    df_baseline,
    "Training F1 Score",
    show_legend=False,
    ylim=1.0
)
plot_vs_embedding_dim(
    df_baseline,
    "Validation F1 Score",
    ylim=1.0,
)
(a) Training F1 Score
(b) Validation F1 Score
Figure 9: Baseline Model - Training and Validation F1 Score

5.1.2 Learning Rate and Compute Time

Code
plot_vs_embedding_dim(
    df_baseline,
    "Learning Rate",
    show_legend=False,
)
plot_vs_embedding_dim(
    df_baseline,
    "Compute Time",
)
(a) Learning Rate
(b) Compute Time Per Epoch [s]
Figure 10: Baseline Model - Learning Rate and Compute Time

Figure Figure 9 (a) illustrates that across embedding dimensions, the training F1 score consistently improves with an increasing number of epochs. Models with lower embedding dimensions, however, converge to a lower final training F1 score, suggesting that these dimensions may limit the model’s complexity capacity. Similarly, in Figure Figure 9 (b), validation F1 scores also increase with training epochs. However, an upper limit is observed, where embedding dimensions above 16 do not demonstrate a significant difference in validation performance.

Based on these trends, training for 50 epochs appears sufficient for models with embedding dimensions greater than 8 to achieve training stability and reach their performance plateau.

Figure Figure 10 (a) visualizes the learning rate progression over training epochs. The learning rate is adjusted dynamically using a ReduceLROnPlateau scheduler based on validation F1 scores, as shown in the code snippet below:

Listing 3: Learning Rate Scheduler
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode="max",
)
# ...
scheduler.step(val_f1)

Models with higher embedding dimensions tend to trigger the scheduler earlier, indicating that these models converge more rapidly. While this is advantageous for reducing the risk of overfitting, it does not necessarily translate into improved validation performance beyond the observed upper limit.

Figure 10 (b) examines the compute times per epoch. Models with embedding dimensions below 256 maintain consistent training times of approximately 2 seconds per epoch. However, as embedding dimensions increase, so do computation times. Given that larger models fail to deliver meaningful performance improvements, it is computationally efficient to select a smaller model that balances performance and training cost effectively.

5.2 Tokenizer Comparison

Code
df_bert_uncased = df.loc[
    (df['Comparison Type'] == 'Tokenizer') & (df['Tokenizer'] == 'bert-base-uncased')
]

df_bert_cased = df.loc[
    (df['Comparison Type'] == 'Tokenizer') & (df['Tokenizer'] == 'bert-base-cased')
]
Code
plot_vs_embedding_dim(df_bert_uncased, "Training F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_bert_cased, "Training F1 Score", ylim=1.0)
(a) Uncased Tokenizer Training Loss
(b) Cased Tokenizer Training Loss
Figure 11: Uncased vs Cased Tokenizer - Training Loss
Code
plot_vs_embedding_dim(df_bert_uncased, "Validation F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_bert_cased, "Validation F1 Score", ylim=1.0)
(a) Uncased Tokenizer F1 Score
(b) Cased Tokenizer F1 Score
Figure 12: Uncased vs Cased Tokenizer - F1 Score

Figure 11 and Figure 12 show no significant differences in performance between the cased and uncased tokenizers, as measured by training and validation F1 scores. However, since the original dataset retains case sensitivity, preserving this feature through a cased tokenizer aligns with the original characteristics of the data. This decision ensures that potentially meaningful information encoded in capitalization is retained.

5.3 Data Level Comparison

Code
df_data_original = df.loc[
    (df['Comparison Type'] == 'Data Level') & (df['Data Level'] == 'original')
]

df_data_a1 = df.loc[
    (df['Comparison Type'] == 'Data Level') & (df['Data Level'] == 'a1')
]

df_data_a2 = df.loc[
    (df['Comparison Type'] == 'Data Level') & (df['Data Level'] == 'a2')
]
Code
plot_vs_embedding_dim(df_data_original, "Training F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_data_a1, "Training F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_data_a2, "Training F1 Score", ylim=1.0)
(a) Original Data - Training F1 Score
(b) A1 Data - Training F1 Score
(c) A2 Data - Training F1 Score
Figure 13: Training F1 Score by Data Levels
Code
plot_vs_embedding_dim(df_data_original, "Validation F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_data_a1, "Validation F1 Score", ylim=1.0)
plot_vs_embedding_dim(df_data_a2, "Validation F1 Score", ylim=1.0)
(a) Original Data - Validation F1 Score
(b) A1 Data - Validation F1 Score
(c) A2 Data - Validation F1 Score
Figure 14: Validation F1 Score by Data Levels

Figure 13 and Figure 14 show no difference in F1 score between diferent data levels. Because altering the data level changes the way the tokenizer functions we will keep the original (unaltered) data as input into the tokenizer

Figures Figure 13 and Figure 14 indicate no measurable difference in F1 scores across various data preprocessing levels. Given that altering the data level modifies how the tokenizer processes the input, it introduces additional complexity without yielding performance benefits. Based on this finding we will use the original unaltered data as input to subsequent models.

5.4 1-3 Embedding Layers

Code
df_data_one_layer = df.loc[
    (df['Comparison Type'] == 'Layers') & (df['Embedding Layers'] == 1)
]

df_data_two_layer = df.loc[
    (df['Comparison Type'] == 'Layers') & (df['Embedding Layers'] == 2)
]

df_data_three_layer = df.loc[
    (df['Comparison Type'] == 'Layers') & (df['Embedding Layers'] == 3)
]
Code
plot_vs_embedding_dim(df_data_one_layer, "Training F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_two_layer, "Training F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_three_layer, "Training F1 Score", ylim=1.0)
(a) 1 Embedding Layer - Training F1 Score
(b) 2 Embedding Layers - Training F1 Score
(c) 3 Embedding Layers - Training F1 Score
Figure 15: Training F1 Score by Data Levels
Code
plot_vs_embedding_dim(df_data_one_layer, "Validation F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_two_layer, "Validation F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_three_layer, "Validation F1 Score", ylim=1.0)
(a) 1 Embedding Layer - Validation F1 Score
(b) 2 Embedding Layers - Validation F1 Score
(c) 3 Embedding Layers - Validation F1 Score
Figure 16: Validation F1 Score by Data Levels

5.5 4-6 Embedding Layers

Code
df_data_one_layer = df.loc[
    (df['Comparison Type'] == 'Big Layers') & (df['Embedding Layers'] == 4)
]

df_data_two_layer = df.loc[
    (df['Comparison Type'] == 'Big Layers') & (df['Embedding Layers'] == 5)
]

df_data_three_layer = df.loc[
    (df['Comparison Type'] == 'Big Layers') & (df['Embedding Layers'] == 6)
]
Code
plot_vs_embedding_dim(df_data_one_layer, "Training F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_two_layer, "Training F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_three_layer, "Training F1 Score", ylim=1.0)
(a) 4 Embedding Layers - Training F1 Score
(b) 5 Embedding Layers - Training F1 Score
(c) 6 Embedding Layers - Training F1 Score
Figure 17: Training F1 Score by Data Levels
Code
plot_vs_embedding_dim(df_data_one_layer, "Validation F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_two_layer, "Validation F1 Score", show_legend=False, ylim=1.0)
plot_vs_embedding_dim(df_data_three_layer, "Validation F1 Score", ylim=1.0)
(a) 4 Embedding Layers - Validation F1 Score
(b) 5 Embedding Layers - Validation F1 Score
(c) 6 Embedding Layers - Validation F1 Score
Figure 18: Validation F1 Score by Data Levels

Increasing the embedding dimensions has a subtle but observable effect on the validation F1 score. Models with higher embedding dimensions inherently introduce dropout, which helps mitigate overfitting by regularizing the network. This regularization contributes to greater training stability, as evidenced by reduced fluctuations in F1 scores across epochs. While the performance gains are marginal, the enhanced stability offered by larger embedding dimensions may provide a slight advantage for tasks requiring consistent and reliable predictions.

5.6 Final Models

Code
df_final = df.loc[
    df['Comparison Type'] == 'Final'
]
Code
plot_vs_embedding_dim(df_final, "Training F1 Score", hue = "Embedding Layers", show_legend=False)
plot_vs_embedding_dim(df_final, "Validation F1 Score", hue = "Embedding Layers")
(a) Training F1 Score
(b) Validation F1 Score
Figure 19: Final Model F1 Scores

In the final output models, we observe a steady increase in the training F1 score across all models, while the validation F1 score stabilizes after ~30 epochs. Notably, models with 2 and 6 embedding layers achieve higher validation F1 scores compared to those with 4 embedding layers. This suggests that embedding layer depth plays a nuanced role in model performance, with certain configurations better capturing the underlying patterns in the data. All models demonstrate the ability to fit the input data effectively and achieve stability within the specified 50 epochs.

6 Results

6.1 Kaggle Screenshot

Figure 20: Kaggle Results

6.2 Results Analysis

Code
df_final = df_final.loc[df_final['Epoch'] == 50]
public_scores = [0.75973, 0.73490, 0.73337]
df_final["Public Score"] = public_scores
Code
# Reshape the data for a grouped bar plot
df_melted = df_final.melt(
    id_vars="Embedding Layers",
    value_vars=["Public Score"],
    var_name="Score Type",
    value_name="Score",
)

# Create the grouped bar plot
plt.figure(figsize=(10, 3))
ax = sns.barplot(data=df_melted, x="Embedding Layers", y="Score", hue="Score Type")
for container in ax.containers:
    ax.bar_label(container, fmt="%.3f")
plt.ylim((0, 0.85))
plt.title("Scores by Embedding Layers")
plt.show()
Figure 21: Final Model Scores
Code
pd.set_option('display.max_rows', 500)
df_melted
Table 9: Final Model Scores
Embedding Layers Score Type Score
0 2 Public Score 0.75973
1 4 Public Score 0.73490
2 6 Public Score 0.73337

In Figure 21 and Table 9 final model Kaggle scores are visualized and tabulated. For each model the other specifications are:

In Figure 21 and Table 9, the final Kaggle scores for each model are visualized and tabulated. These scores represent the performance of the models on the public leaderboard, providing a benchmark for comparison. Based on these results, the model with 2 embedding layers achieved the highest public score, suggesting that simpler architectures may offer better generalization for this task. This outcome aligns with observations from the validation F1 scores, where models with fewer embedding layers demonstrated comparable or superior performance relative to more complex configurations.

6.2.1 Final Model Specifications

Code
from IPython.display import display, Markdown

df_final['Start Learning Rate'] = 0.001
df_final['End Learning Rate'] = df['Learning Rate']

df_specs = (
    df_final[
        [
            "Learning Optimization",
            "Start Learning Rate",
            "End Learning Rate",
            "Specified Epochs",
            "Batch Size",
            "Data Level",
            "Vocab Size",
            "Tokenizer",
            "Embedding Dimensions",
            "Hidden Dimensions",
        ]
    ]
    .iloc[0]
    .T
)

# Convert series into df
df_specs = df_specs.reset_index()
df_specs.columns = ["Specification", "Value"]

display(Markdown(df_specs.to_markdown(index=False)))
Table 10: Final Model Specifications
Specification Value
Learning Optimization default
Start Learning Rate 0.001
End Learning Rate 1.0000000000000002e-06
Specified Epochs 50
Batch Size 256
Data Level original
Vocab Size 28996
Tokenizer bert-base-cased
Embedding Dimensions 128
Hidden Dimensions 256

6.2.2 Additional Final Model Statistics

Code
# Reshape the data for a grouped bar plot
df_melted = df_final.melt(
    id_vars="Embedding Layers",
    value_vars=["Validation Accuracy Score", "Validation Precision Score", "Validation Recall Score"],
    var_name="Score Type",
    value_name="Score",
)

# Create the grouped bar plot
plt.figure(figsize=(10, 3))
ax = sns.barplot(data=df_melted, x="Embedding Layers", y="Score", hue="Score Type")
for container in ax.containers:
    ax.bar_label(container, fmt="%.3f")
plt.ylim((0, 0.85))
plt.title("Scores by Embedding Layers")
plt.show()
Figure 22: Final Model Scores - Kaggle Public Score vs. Accuracy Score
Code
pd.set_option('display.max_rows', 500)
df_melted
Table 11: Final Model Additional Statistics
Embedding Layers Score Type Score
0 2 Validation Accuracy Score 0.768221
1 4 Validation Accuracy Score 0.741300
2 6 Validation Accuracy Score 0.749836
3 2 Validation Precision Score 0.781197
4 4 Validation Precision Score 0.737624
5 6 Validation Precision Score 0.730475
6 2 Validation Recall Score 0.670088
7 4 Validation Recall Score 0.655425
8 6 Validation Recall Score 0.699413

In addition to the public Kaggle scores, the models were validated using accuracy, precision, and recall, and F1 score, shown in Figure 21 and Table 11. These metrics provide a comprehensive view of model performance and highlight differences in how the models handle the classification task.

  1. Accuracy:
    • The model with 2 embedding layers achieved the highest validation accuracy (0.768221), outperforming both the 4-layer and 6-layer models.
    • While the accuracy decreases slightly with an increase in embedding layers, the difference is small.
  2. Precision:
    • Precision is highest for the 2-layer model (0.781197), indicating its strength in minimizing false positives.
    • As the number of embedding layers increases, precision declines, with the 6-layer model scoring the lowest (0.730475).
  3. Recall:
    • Recall improves with more embedding layers, with the 6-layer model achieving the highest score (0.699413). This suggests that models with more embedding layers are better at capturing true positives, albeit at the expense of increased false positives.

7 Conclusion

7.1 Project Summary

This project explored the classification of disaster-related tweets using Recurrent Neural Networks (RNNs) built with PyTorch. Various configurations, including embedding dimensions, tokenizer choices, and data cleaning levels, were systematically evaluated to tune hyperparameters optimally. The results demonstrated that a 2-layer embedding model achieved the highest overall performance, balancing validation accuracy (0.768221), precision (0.781197), recall (0.670088), and F1 score (0.721389) and a public Kaggle evaluation scores of 0.75973. The findings underscore the value of simplicity in model architecture, with more complex configurations yielding diminishing returns.

7.2 Lessons Learned

  1. Model Architecture and Complexity: Increasing embedding layers and introducing dropout led to marginal stability improvements but did not significantly enhance validation F1 scores. Simpler architectures performed comparably or better in most cases.

  2. Tokenizer and Data Cleaning: The cased tokenizer demonstrated equivalent performance to the uncased version, justifying the retention of the original data’s case sensitivity. Altering data levels disrupted tokenizer behavior without providing measurable benefits.

  3. Layer Size: Models with fewer embedding layers converged faster and triggered the learning rate scheduler earlier, suggesting efficiency advantages in training. All models reached stability within 50 epochs, highlighting the importance of early stopping to reduce computational overhead.

  4. Evaluation Metrics: The F1 score proved to be the most informative metric for this dataset, balancing precision and recall effectively. Relying solely on accuracy or public Kaggle scores would have overlooked critical trade-offs in model performance.

7.3 Areas for Improvement / Future Work

  1. Feature Engineering: Incorporating additional features, such as sentiment analysis scores or tweet metadata, could enhance the models ability to capture nuanced patterns in the data.

  2. Architectural Changes: Future experiments could include transformer-based architectures, such as BERT or GPT, to assess whether advanced models outperform RNNs for this task.

  3. Generalization Analysis: While this project focused on disaster-related tweets, extending the dataset to include non-disaster events could help test the model’s generalization capabilities across broader text classification domains.

References

Ansel, Jason, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, et al. 2024. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation.” In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). ACM. https://doi.org/10.1145/3620665.3640366.
Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc. https://www.nltk.org/book/.
Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
Howard, Addison, devrishi, Phil Culliton, and Yufeng Guo. 2019. “Natural Language Processing with Disaster Tweets.” https://kaggle.com/competitions/nlp-getting-started.
Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Perric Cistac, et al. 2022. Transformers: State-of-the-Art Natural Language Processing.” Zenodo. https://doi.org/10.5281/zenodo.7391177.